Query Rewrite With Auxiliary Attributes In Query Processing Operations

ABSTRACT

Techniques are provided for rewriting queries during a database query processing operation to include auxiliary attributes not included in the original query, thus improving processing efficiency. For example, a technique for rewriting a query in a query processing operation includes the following steps. The query is processed in accordance with at least a portion of a data set, producing query results. Data attributes from the query results are analyzed. At least one new predicate from at least one auxiliary data attribute is appended on the query.

CROSS-REFERENCE TO RELATED APPLICATION(S)

This application is a continuation of U.S. application Ser. No.11/852,705, filed on Sep. 10, 2007, which is a continuation of U.S.application Ser. No. 10/668,549, filed on Sep. 23, 2003 now issued asU.S. Pat. No. 7,587,394, the disclosures of which are incorporatedherein by reference.

FIELD

The field is related to query processing operations in databases andother information systems and, more particularly, to rewriting queriesduring database query processing operations.

BACKGROUND

Due to an increase in the amount of online business transactions,efficiency in the query processing of large amounts of data stored indatabases, data warehouses, and other information systems, has becomeincreasingly important. Since the late 1970s, research and developmentactivities have persisted throughout academia and industry in studyingefficient query processing operations in standalone databases,distributed databases, hierarchical databases, spatial databases, anddatabases on the World Wide Web. Comprehensive reviews of reported queryprocessing techniques have been provided in several sources, see, e.g.,G. Graefe, “Query Evaluation Techniques for Large Databases,” ACMComputing Surveys, Volume 25 Issue 2, June 1993; and D. Kossmann, “TheState of the Art in Distributed Query Processing,” ACM ComputingSurveys, Volume 32 Issue 4, December 2000.

Database query processing operations typically consist of six phases:parsing, semantic checking, query rewrite, plan optimization, planrefinement, and query evaluation. The reasons for rewriting a query inquery processing operations have been described in several references,see, e.g., L. M. Haas et al., “Extensible Query Processing inStarburst,” ACM SIGMOD Record, Volume 18 Issue 2, June 1989. Somereasons for query rewrite include:

1. Alternative phrasing of a query: Since relational query languages arenonprocedural, the alternatives perform equivalently; and

2. Involvement of view abstractions in a query: Since view definitionsare hidden from the query writer, a query involving views can only beexpanded by the database.

Traditional query rewrite strategies focus on rewrite and optimizationfor attributes that appear in query predicates, see, e.g., G. Graefe etal., “Dynamic Query Evaluation Plans,” ACM SIGMOD Record, Volume 18Issue 2, June 1989; and D. Kossmann et al., “Iterative DynamicProgramming: A New Class of Query Optimization Algorithms,” ACMTransactions on Database Systems, Volume 25 Issue 1, March 2000.

SUMMARY

Embodiments of the invention provide techniques for rewriting queriesduring query processing operations.

For example, in one embodiment of the invention, a technique forrewriting a query during a database query processing operation includesthe following steps. First, the query is processed in accordance with atleast a portion of a data set, producing query results. Second, dataattributes from the query results are analyzed. Finally, a new predicatefrom a data attribute is appended on the query. Thus, advantageously anew predicate relating to a data attribute not included in the originalquery may be appended on the query, thus improving processingefficiency.

Several major differences exist between the principles of the inventionand traditional methodologies. For example, principles of the invention:(i) may advocate a query rewrite strategy to append new predicates basedon attributes absent from the original query; (ii) may employ a two-passquery processing operation approach as opposed to the prevalentsingle-pass processing; (iii) may employ statistical analysis and otherdata mining techniques to help identify distinguishing attributes; and(iv) may have false dismissals due to query rewrite, while traditionalmethodologies emphasize exact answers.

Thus, embodiments of the invention propose at least one additionalreason for query rewrite that is distinct from traditional querynormalization approaches. This distinct reason is to leverage theknowledge about auxiliary attributes in data records to appendadditional query predicates for more efficient processing.Advantageously, the inventive technique may be useful for providingefficient query processing operations in large databases, whilemaintaining the accuracy of results. This efficiency translates anormally high processing cost query into a low cost query.

These and other objects, features, and advantages will become apparentfrom the following detailed description of illustrative embodimentsthereof, which is to be read in connection with the accompanyingdrawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a table illustrating electricity usage data stored in adatabase, for use in explaining distinctions between embodiments of theinvention and traditional methodologies;

FIG. 2 is a flow diagram illustrating a first pass query executionmethodology, according to an embodiment of the present invention;

FIG. 3 is a flow diagram illustrating a data sampling methodology,according to an embodiment of the present invention;

FIG. 4 is a flow diagram illustrating a selective attribute selectionmethodology, according to an embodiment of the present invention;

FIG. 5 is a flow diagram illustrating a query rewrite methodology,according to an embodiment of the present invention; and

FIG. 6 is a diagram illustrating an illustrative hardware implementationof a computing system in accordance with which one or morecomponents/methodologies of the present invention may be implemented,according to an embodiment of the present invention.

DETAILED DESCRIPTION

The following description will illustrate principles of the inventionusing an exemplary data processing system architecture. Principles ofthe invention are not limited to use with any particular systemarchitecture. Instead it is more generally applicable to any dataprocessing system in which it is desirable to perform efficient andeffective query processing operations.

The term “database,” as used herein, is intended to include any datastorage software and systems, such as, for example, a relationaldatabase like IBM DB2 and Oracle 9. The term “data record,” as usedherein, is intended to include any logical structured association ofdata in a database, such as, for example, a row in a table of arelational database. The terms “attribute” and “data attribute,” as usedherein, are intended to include any element of a data record, such as,for example, a column value of a row. The term “target attribute,” asused herein, is intended to refer to any attribute that explicitlyappears in a user query. The term “auxiliary attribute,” as used herein,is intended to refer to any attribute that is missing from a user query.

Principles of the invention are applicable to query processingoperations on a database or other information system that storesstructured data records with one or more attributes. For ease ofillustration and description, an example based on query processingoperations on a database table is used in the preferred embodiment ofthe present invention. The present invention should not be viewed aslimited or constrained to processing on database tables.

Referring initially to FIG. 1, a table illustrates electricity usagedata stored in a database, for use in explaining distinctions betweenthe principles of the invention and traditional methodologies. The tablehas four attribute labels in its header row; reporting date 100,Westchester electricity usage 102, New York City electricity usage 104,and temperature 106. An exemplary query 108 asks for dates having atotal daily electricity usage above 3000 MegaWatt for these twogeographic areas. In this instance, the total daily electricity usage iscomputed by adding the daily electricity usage of New York City (NYC) tothe daily electricity usage of Westchester County, N.Y. Thus, the querystatement has a single predicate regarding the sum of the values forWestchester electricity usage 102 and NYC electricity usage 104 beinggreater than 3000 MW for a reporting date 100.

Traditional query rewrite focuses on attribute columns, such asreporting date 100, Westchester electricity usage 102, and NYCelectricity usage 104. However, through the analysis of attribute valuedistributions, it may be discovered that the total electricity usage isabove 3000 MW only when the outdoor temperature is above 90 EF. Thetemperature attribute, which is not specified in the original query, isauxiliary, but can significantly reduce query processing time by addingan additional predicate to the query. It would thus be beneficial toinclude the temperature predicate at the query rewrite phase. However,traditional query rewrite strategies have not included predicates onauxiliary attributes as proposed by principles of the invention.

The query rewrite methodology proposed in the present invention includestwo passes. The first pass discovers attributes that have a highselectivity from an analysis of query results from sampled data. Thesecond pass selects top-ranked auxiliary attributes and appends newpredicates relating to the selected attributes to the original query forimproved query efficiency.

Referring now to FIG. 2, a flow diagram illustrates a first pass queryexecution methodology, according to an embodiment of the presentinvention. The methodology begins by inputting a user query 200. Atraditional, unmodified query processing operation is first used toexecute user query 200 on sampled data 204 from a dataset in step 202.Results of user query 200 are returned as selected data records 206.Statistics for each attribute value of selected data records 206 areextracted in step 208. Separately, statistics for identical attributesare collected and extracted from sampled data 204 in step 210. The setsof statistics from steps 208 and 210, are evaluated in step 212. Thisevaluation is described in further detail in FIG. 4.

In accordance with principles of the invention, for numericalattributes, typical statistics extracted in steps 208 and 210 mayinclude a maximum, a minimum, a median, an average, and a standarddeviation. For categorical attributes, the extracted statistics in steps208 and 210 may include the number of distinct values and the aggregatedcounts for each value. Higher order statistics involving jointstatistics of two or more attributes can also be extracted. In thisembodiment, the present invention applies the maximum and minimum valuesof numerical attributes but its scope is not limited to theaforementioned statistics.

Referring now to FIG. 3, a flow diagram illustrates a data samplingmethodology, according to an embodiment of the present invention. Thiscan be considered a detailed description of the acquisition of sampleddata 204 in FIG. 2. This methodology reduces the number of data recordsfor the first-pass query processing operation of FIG. 2. Every Nthrecord of a full data set 300 is sampled in step 302 resulting in asampled data set 304. N is typically in the range of 10 to 100. Theresulting records of sampled data set 304 are stored in a smallertemporary database.

Referring now to FIG. 4, a flow diagram illustrates a selectiveattribute selection methodology, according to an embodiment of thepresent invention. This methodology may be considered a detaileddescription of step 212 in FIG. 2. The methodology illustrates apreferred evaluation method to identify selective attributes. However,principles of the invention are not limited to this particularevaluation and is applicable to other evaluation criteria. In FIG. 4,each numerical attribute, input in step 400, is examined by comparingthe dynamic ranges of attribute statistics from sampled data toattribute statistics from query results. Both target and auxiliaryattributes are examined. The dynamic range of an attribute may bedefined as the difference between the maximum and minimum values. Step402 computes a ratio R for the two dynamic ranges. In step 404, it isdetermined whether R is smaller than 10%. If R is smaller than 10% theattribute is selective 408. Otherwise, the attribute is non-selective406. The threshold for R is preferably at or below 10%.

Referring now to FIG. 5, a flow diagram illustrates a query rewritemethodology and second query pass, according to an embodiment of thepresent invention. The methodology begins at step 500 where the relativeselectivity for each selective attribute is evaluated. These selectiveattributes 408 were discovered in step 404 of FIG. 4. The second querypass begins by selecting high selectivity attributes in step 502. Theremay be multiple selective attributes, which are ranked in ascendingorder by the value of R. A small R value is preferred because itindicates higher selectivity and higher efficiency. Depending onspecific query scenarios, one or more top-ranked attributes areselected. Range predicates are then formed in step 504. The dynamicranges of selected attributes from query results form new rangepredicates by bounding the minimum and maximum values. For example, thetemperature attribute in FIG. 1 has a maximum value 95 and a minimumvalue 91 for dates whose total electric usage is above 3000 MW. Therange predicate of temperature becomes [91, 95]. User query 508 is thenappended with new range (temperature) predicates in step 506. Themethodology terminates with the resulting modified query 510, such as,for example, “Select date from table where Westchester+NewYorkCity >3000AND temperature <=95 and temperature >=91.” The newly appended query 510is then passed to the traditional query processing operation andoptimization, which will use the newly appended, highly selectivepredicates to reduce the processing time.

In accordance with the present invention, the appended query may notyield the same set of results as the original query. This is because thesampling step in FIG. 3 may miss outliers in the data records which arenot included in the attribute selection and predicate construction.However, an appended query and its original query yield identicalresults in many practical applications, such as in the electricity usagequery example. The samples in those applications usually have less noiseand smoother variations.

Referring now to FIG. 6, a block diagram illustrates an illustrativehardware implementation of a computing system in accordance with whichone or more components/methodologies of the invention (e.g.,components/methodologies described in the context of FIGS. 1 through 5)may be implemented, according to an embodiment of the present invention.For instance, such a computing system in FIG. 6 may implement a databaseand process user query 204 of FIG. 2.

It is to be understood that such individual components/methodologies maybe implemented and processed on one such computer system, or on morethan one such computer system. For instance, user query 204 may beprocessed on one computer system (e.g., client device), while thedatabase may be implemented on another computer system. In the case ofan implementation in a distributed computing system, the individualcomputer systems and/or devices may be connected via a suitable network,e.g., the Internet or World Wide Web. However, the system may berealized via private or local networks. The invention is not limited toany particular network.

As shown, the computer system may be implemented in accordance with aprocessor 610, a memory 612, I/O devices 614, and a network interface616, coupled viaa computer bus 618 or alternate connection arrangement.

It is to be appreciated that the term “processor” as used herein isintended to include any processing device, such as, for example, onethat includes a CPU (central processing unit) and/or other processingcircuitry. It is also to be understood that the term “processor” mayrefer to more than one processing device and that various elementsassociated with a processing device may be shared by other processingdevices.

The term “memory” as used herein is intended to include memoryassociated with a processor or CPU, such as, for example, RAM, ROM, afixed memory device (e.g., hard drive), a removable memory device (e.g.,diskette), flash memory, etc.

In addition, the phrase “input/output devices” or “I/O devices” as usedherein is intended to include, for example, one or more input devices(e.g., keyboard, mouse, etc.) for entering data to the processing unit,and/or one or more output devices (e.g., speaker, display, etc.) forpresenting results associated with the processing unit.

Still further, the phrase “network interface” as used herein is intendedto include, for example, one or more transceivers to permit the computersystem to communicate with another computer system via an appropriatecommunications protocol (e.g., HTTP/S). Accordingly, software componentsincluding instructions or code for performing the methodologiesdescribed herein may be stored in one or more of the associated memorydevices (e.g., ROM, fixed or removable memory) and, when ready to beutilized, loaded in part or in whole (e.g., into RAM) and executed by aCPU.

Although illustrative embodiments of the present invention have beendescribed herein with reference to the accompanying drawings, theinvention is not limited to those precise embodiments, and that variousother changes and modifications may be made by one skilled in the artwithout departing from the scope or spirit of the invention.

1. Apparatus for rewriting a query during a database query processingoperation, comprising: a memory; and at least one processor coupled tothe memory and operative to: (i) produce query results comprising one ormore target attributes and one or more auxiliary attributes by executingthe query on at least a portion of a data set, wherein the one or moretarget attributes are included in the query and the one or moreauxiliary attributes are not included in the query; (ii) determine arelative selectivity for the one or more auxiliary attributes bycomparing the query results to at least the portion of the data set fromwhich the query results were produced; and (iii) append the query withat least one new predicate corresponding to at least one of the one ormore auxiliary attributes having a high relative selectivity to form arewritten query.
 2. The apparatus of claim 1, wherein the query resultsare produced by executing the query on records sampled from the dataset.
 3. The apparatus of claim 2, wherein the processor is furtheroperative to sample the records from the data set.
 4. The apparatus ofclaim 3, wherein the operation of sampling the records comprisessampling every Nth record from the data set, wherein N is a positiveinteger greater than or equal to two.
 5. The apparatus of claim 2,wherein the at least one processor is further operative to execute therewritten query on the data set from which the records were sampled. 6.The apparatus of claim 1, wherein the operation of comparing the queryresults to at least the portion of the data set comprises: extractingstatistics for the one or more auxiliary attributes from the queryresults; extracting statistics for the one or more auxiliary attributesfrom at least the portion of the data set; and evaluating the relativeselectivity for the one or more auxiliary attributes by comparing theextracted statistics from the query results to the extracted statisticsfrom at least the portion of the data set.
 7. The apparatus of claim 1,wherein the operation of determining the relative selectivity comprises:comparing a range of values in the query results for a given one of theone or more auxiliary attributes to a range of values in at least theportion of the data set for the given one of the one or more auxiliaryattributes.
 8. The apparatus of claim 7, wherein the operation ofdetermining the relative selectivity further comprises comparing a ratioof the ranges to a predetermined value.
 9. The apparatus of claim 1,wherein the operation of appending the query with at least one newpredicate comprises: selecting at least one auxiliary data attributewith a high relative selectivity; forming at least one new predicatecorresponding to the selected at least one auxiliary attribute; andappending the query with the at least one new predicate.
 10. Theapparatus of claim 1, wherein the at least one processor is furtheroperative to execute the rewritten query on the data set.
 11. Theapparatus of claim 1, wherein the operation of appending the query withat least one new predicate comprises: selecting the at least one of theone or more auxiliary attributes based at least in part on a ranking ofthe determined relative selectivity for the one or more auxiliaryattributes; and appending the query with at least one new predicatecorresponding to the selected at least one of the one or more auxiliaryattributes to form the rewritten query.
 12. A method for rewriting aquery during a database query processing operation, comprising the stepsof: producing query results comprising one or more target attributes andone or more auxiliary attributes by executing the query on at least aportion of a data set, wherein the one or more target attributes areincluded in the query and the one or more auxiliary attributes are notincluded in the query; determining a relative selectivity for the one ormore attributes by comparing the query results to at least the portionof the data set from which the query results were produced; andappending the query with at least one new predicate corresponding to atleast one of the one or more auxiliary attributes having a high relativeselectivity to form a rewritten query.
 13. The method of claim 12,wherein the query results are produced by executing the query on recordssampled from the data set.
 14. The method of claim 13, furthercomprising the step of sampling the records from the data set.
 15. Themethod of claim 14, wherein the step of sampling the records comprisessampling every Nth record from the data set, wherein N is a positiveinteger greater than or equal to two.
 16. The method of claim 12,wherein the step of comparing the query results to at least the portionof the data set comprises: extracting statistics for the one or moreauxiliary attributes from the query results; extracting statistics forthe one or more auxiliary attributes from at least the portion of thedata set; and evaluating the relative selectivity for the one or moreauxiliary attributes by comparing the extracted statistics from thequery results to the extracted statistics from at least the portion ofthe data set.
 17. The method of claim 12, wherein the step ofdetermining the relative selectivity comprises: comparing a range ofvalues in the query results for a given one of the one or more auxiliaryattributes to a range of values in at least the portion of the data setfor the given one of the one or more auxiliary attributes; and comparinga ratio of the ranges to a predetermined value.
 18. The method of claim12, further comprising the step of executing the rewritten query on thedata set.
 19. The method of claim 12, wherein the step of appending thequery with at least one new predicate comprises: selecting at least oneof the one or more auxiliary attributes based at least in part on aranking of the determined relative selectivity for the one or moreauxiliary attributes; and appending the query with at least one newpredicate corresponding to the selected at least one of the one or moreauxiliary attributes to form the rewritten query.
 20. An article ofmanufacture for rewriting a query during a database query processingoperation; comprising a machine readable storage medium containing oneor more programs which when executed implement the steps of: producingquery results comprising one or more target attributes and one or moreauxiliary attributes by executing the query on at least a portion of thedata set, wherein the one or more target attributes are included in thequery and the one or more auxiliary attributes are not included in thequery; determining a relative selectivity for the one or more auxiliaryattributes by comparing the query results to at least the portion of thedata set from which the query results were produced; and appending thequery with at least one new predicate corresponding to at least one ofthe one or more auxiliary attributes having a high relative selectivityto form a rewritten query.